Chinese word segmentation model using bootstrapping

نویسندگان

  • Baobao Chang
  • Mairgup Mansur
چکیده

We participate in the CIPS-SIGHAN2010 bake-off task of Chinese word segmentation. Unlike the previous bakeoff series, the purpose of the bakeoff 2010 is to test the crossdomain performance of Chinese segmentation model. This paper summarizes our approach and our bakeoff results. We mainly propose to use χ statistics to increase the OOV recall and use bootstrapping strategy to increase the overall F score. As the results shows, the approach proposed in the paper does help, both of the OOV recall and the overall F score are improved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks wellperformed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional...

متن کامل

A Web-based Approach To Chinese Word Segmentation

Chinese text processing requires the detection of word boundaries. This is a non-trivial step because Chinese does not contain explicit whitespace between words. Existing word segmentation techniques make use of precompiled dictionaries and treebanks. The creation of dictionaries and treebanks is a labor-intensive process and consequently they are updated infrequently. Furthermore, due to their...

متن کامل

A New Approach for English-Chinese Named Entity Alignment

Traditional word alignment approaches cannot come up with satisfactory results for Named Entities. In this paper, we propose a novel approach using a maximum entropy model for named entity alignment. To ease the training of the maximum entropy model, bootstrapping is used to help supervised learning. Unlike previous work reported in the literature, our work conducts bilingual Named Entity align...

متن کامل

Chinese Word Segmentation Based On Direct Maximum Entropy Model

Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...

متن کامل

Extracting Pronunciation-translated Names from Chinese Texts using Bootstrapping Approach

Pronunciation-translated names (P-Names) bring more ambiguities to Chinese word segmentation and generic named entity recognition. As there are few annotated resources that can be used to develop a good P-Name extraction system, this paper presents a bootstrapping algorithm, called PN-Finder, to tackle this problem. Starting from a small set of P-Name characters and context cue-words, the algor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010